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Abstract 

This paper introduces a novel message-passing (MP) framework for the collaborative filtering (CF) 
problem associated with recommender systems. We model the movie-rating prediction problem popu- 
larized by the Netflix Prize, using a probabilistic factor graph model and study the model by deriving 
generalization error bounds in terms of the training error. Based on the model, we develop a new MP 
algorithm, termed IMP, for learning the model. To show superiority of the IMP algorithm, we compare 
it with the closely related expectation-maximization (EM) based algorithm and a number of other matrix 
completion algorithms. Our simulation results on Netflix data show that, while the methods perform 
similarly with large amounts of data, the IMP algorithm is superior for small amounts of data. This 
improves the cold-start problem of the CF systems in practice. Another advantage of the IMP algorithm 
is that it can be analyzed using the technique of density evolution (DE) that was originally developed for 
MP decoding of error-correcting codes. 

Index Terms 

Belief propagation, message-passing; factor graph model; collaborative filtering, recommender sys- 
tems. 

I. Introduction 

One compelling application of collaborative filtering is the automatic generation of recommendations. 
For example, the Netflix Prize [9] has increased the interest in this field dramatically. Recommendation 
systems analyze, in essence, patterns of user interest in items to provide personalized recommendations 
of items that might suit a user's taste. Their ability to characterize and recommend items within huge 
collections has been steadily increasing and now represents a computerized alternative to human recom- 
mendations. In the collaborative filtering, the recommender system would identify users who share the 
same preferences (e.g. rating patterns) with the active user, and propose items which the like-minded users 
favored (and the active user has not yet seen). One difficult part of building a recommendation system 
is accurately predicting the preference of a user, for a given item, based only on a few known ratings. 
The collaborative filtering problem is now being studied by a broad research community including groups 
interested in statistics, machine learning and information theory [5], [4]. Recent works on the collaborative 
filtering problem can be largely divided into two areas: 

1) The first area considers efficient models and practical algorithms. There are two primary approaches: 
neighborhood model approaches that are loosely based on "/c-Nearest Neighbor" algorithms and 
factor models (e.g., low dimension or low-rank models with a least squares flavor) such as hard 
clustering based on singular vector decomposition (SVD) or probabilistic matrix factorization (PMF) 
and soft clustering which employs expectation maximization (EM) frameworks [5], [14], [15], [3], 
[16]. 

2) The second area involves exploration of the fundamental limits of these systems. Prior work has 
developed some precise relationships between sparse observation models and the recovery of missing 
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entries in terms of the matrix completion problem under the restriction of low-rank matrices model 
or clustering models [6], [7], [13]. This area is closely related with the practical issues known as 
cold-start problem [20], [9]. That is, giving recommendations to new users who have submitted 
only a few ratings, or recommending new items that have received only a few ratings from users. 
In other words, how few ratings to be provided for the the system to guess the preferences and 
generate recommendations? 
In this paper, we employ an alternative modern coding-theoretic approach that have been very successful 
in the field of error-correcting codes to the problem. Our results are different from the above works in 
several aspects as outlined below. 

1) Our approach tries to combine the benefits of clustering users and movies into groups proba- 
bilistically and applying a factor analysis to make predictions based on the groups. The precise 
probabilistic generative factor graph model is stated and generalization error bounds of the model 
with some observations are studied in Sec. II. Based on the model, we derive a MP based algorithms, 
termed IMP, which has demonstrated empirical success in other applications: low-density parity- 
check codes and turbo-codes decoding. Furthermore, as a benchmark, popular EM algorithms which 
are frequently used in both learning and coding community [3], [11], [2] are developed in Sec. III. 

2) Our goal is to characterize system limits via modern coding-theoretic techniques. Toward this end, 
we provide a characterization of the messages distribution passed on the graph via density evolution 
(DE) in Sec. IV. DE is an asymptotic analysis technique that was originally developed for MP 
decoding of error-correcting codes. Also, through the emphasis of simulations on cold-start settings, 
we see the cold start problem is greatly reduced by the IMP algorithm in comparison to other 
methods on real Netflix data com in Sec. V. 

II. Factor graph model 

A. Model Description 

Consider a collection of N users and M movies when the set O of user-movie pairs have been observed. 
The main theoretical question is, "How large should the size of O be to estimate the unknown ratings 
within some distortion <5?". Answers to this question certainly require some assumptions about the movie 
rating process as been studied by prior works [6], [7]. So we begin differently by introducing a probabilistic 
model for the movie ratings. The basic idea is that hidden variables are introduced for users and movies, 
and that the movie ratings are conditionally independent given these hidden variables. It is convenient to 
think of the hidden variable for any user (or movie) as the user group (or movie group) of that user (or 
movie). In this context, the rating associated with a user-movie pair depends only on the user group and 
the movie group. 

Let there be g u user groups, g v movie groups, and define [k] = {1,2,..., k}. The user group of the 
n-th user, U n € [g u ], is a discrete r.v. drawn from Pr(U n = u) = pu(u) and U = Ui, U2, • • • , Un is 
the user group vector. Likewise, the movie group of the m-th movie, V m € [g v ], is a discrete r.v. drawn 
from Pr(y m = v) = pv(v) and V = V±, V2, . . . , Vm is the movie group vector. Then, the rating of 
the m-th movie by the n-th user is a discrete r.v. R nm £ Ti (e.g., Netflix uses 1Z = [5]) drawn from 
Pr(R nm = r\U n = u,V m = v) = w(r\u,v) and the rating R nm is conditionally independent given the 
user group U n and the movie group V m . Let R denote the rating matrix and the observed submatrix be 
Ro with O C [N] x [M]. In this setup, some of the entries in the rating matrix are observed while others 
must be predicted. The conditional independence assumption in the model implies that 

Pr(R G |U,V)^ H w(R nm \U n ,V m ). 

(n,m)EO 
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Figure 1. The factor graph model for the collaborative filtering problem. The graph is sparse when there are few ratings. Edges 
represent random variables and nodes represent local probabilities. The node probability associated with the ratings implies that 
each rating depends only on the movie group (top edge) and the user group (bottom edge). Synthetic data can be generated by 
picking i.i.d. random user/movie groups and then using random permutations to associate groups with ratings. Note x^' and 
y^ l) are the messages from movie to user and user to movie during iteration i for the Algorithm 1. 



Specifically, we consider the factor graph (composed of 3 layers, see Fig. 1) as a randomly chosen 
instance of this problem based on this probabilistic model. The key assumptions are that these layers 
separate the influence of user groups, movie groups, and observed ratings and the outgoing edges from 
each user node are attached to movie nodes via a random permutation. 

The main advantage of our model is that, since it exploits the correlation in ratings based on simi- 
larity between users (and movies) and includes noise process, this model approximates real Netflix data 
generation process more closely than other simpler factor models. It is also important to note that this is 
a probabilistic generative model which allows one to evaluate different learning algorithms on synthetic 
data and compare the results with theoretical bounds (see Sec. V-B for details). 



B. Generalization Error Bound 

In this section, we consider bounds on generalization from partial knowledge of the (binary-rating) matrix 
for collaborative filtering application. The tighter bound implies one can use most of known ratings for 
learning the model completely. Since computation of R can be viewed as the product of three matrices, 
we consider the simplified class of tri-factorized matrices Xg u ,gv as > 

{X\X = U T WV,U G [0, 1} 9 - XN ,V & [0, ir xM ,W € {±1} 9 " X9 "}. 

We bound the overall distortion between the entire predicted matrix X and the true matrix Y as a function 
of the distortion on the observed set of size \0\ and the error e. Let y € {±1} be binary ratings and 
define a zero-one sign agreement distortion as 

d{x,y)±[ l lfxy ~°. 

I otherwise 
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Also, define the average distortion over the entire prediction matrix as 

D(X,Y)± £ d(x,y)/NM 

(n,m)e[N]x[M] 

and the averaged observed distortion as 

D (X, Y)± d(x,y)/\0\. 

(n,m)£0 

Theorem 1: For any matrix Y G {±l} NxM , N, M > 2, 5 > and integers g u and g v , with probability 
at least 1 — 5 over choosing a subset O of entries in Y uniformly among all subsets of \0\ entries 
VX € Xgu,9„> \ D ( x , Y ) ~ D o (X, Y) | is upper bounded by 



(Ng u + Mg v + g u g v ) log , 12eM - log5 ^ /2|0| = h (g u , g v , N, M, \0\) . 

Proof: The proof of this theorem is given in Appendix A. ■ 
Let us finish this section with two implications of the Thm. 1 in terms of the five parameters: 

g u , g v , N, M, \0\. 

1) For fixed group numbers g u and g v , as number of users N and movies M increases, to keep the 
bound tight, number of observed ratings \0\ also needs to grow in the same order. 

2) For a fixed sized matrix, when the choice of g u and/or g v increases, \0\ needs to grow in the same 
order to prevent over-learning the model. Also, as \0\ increases, we could increase the value of g u 
and/or g v . 



III. Learning Algorithms 

A. Message Passing (MP) Learning 

Once a generative model describing the data has been specified, we describe how two algorithms can be 
applied in the model using a unified cost function, the free energy. Since exact learning and inference 
are often intractable, so we turn to approximate algorithms that search distributions that are close to 
the correct posterior distribution by minimizing pseudo-distances on distributions, called free energies 
by statistical physicists. The problem can be formulated via message-passing (also known as belief 
propagation) framework via the sum-product algorithm since fixed points of (loopy) belief propagation 
correspond to extrema of the Bethe approximation of the free energy [17]. The basic idea is that the local 
neighborhood of any node in the factor graph is tree-like, so that belief propagation gives a nearly optimal 
estimate of the a posteriori distributions. We denote the message from movie m to user n during iteration 
% by Xm-rn and the message from user n to movie m by yn\ m - The set of all users whose rating movie 
m was observed is denoted U m and the set of all movies whose rating by user n was observed is denoted 
V n . The exact update equations are given in Algorithm 1. Though the idea is similar to an EM update, 
the resulting equation are different and seem to perform much better. 



B. Expectation Maximization (EM) Learning 

Now, we reformulate the problem in a standard variational EM framework and propose a second algorithm 
by minimizing an upper bound on the free energy [2]. In other words, we view the problem as maximum- 
likelihood parameter estimation problem where pu n {')> Pv m {'), an d Pr\u,m{'\') are trie model parameters 
9 and U, V are the missing data. For each of these parameters, the i-th estimate is denoted fn\u), 
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Algorithm 1 IMP Algorithm 



Step I: Initialization 

x£U n (v) = -x$(v)=pv(v), y^l m (u) = y { n ) (u)=p u (u), w(r\u,v) 
Step II: Recursive update 

yn ] ( U ) Jl J2 W (. r n,m\u,v)^\ n {v) 



2>»V) II Y. w ^m\w,v)4L(v) 

W keV n \m v 



X (i+V( v )- keU m \nu 



Step III: Output 



v ' keU m \n u 

y^Jn^m{u)xlr^l(v)w (r\u, v) 

Y^^ynXm(u)yL^l(v)w (r\u, v) 

r u,v 

yi 0) (u) Y[ (r n , m \u, v) x%\ n (v) 

( u )- fcgv " " 

u' keV n v 

(°)^TTY^„^ i„. „,\„M 



x m (u) II (r n , m |u, u) 



hm(v), and u;W(r|«, v). Let O C [iV] x [M] be the set of user-movie pairs that have been observed. 
Then, we can write the complete data (negative) log-likelihood as 

R c (0) = -log ] [ Pr (R nm = r njm , U n = u n , V m = v m ) 

(n,m)£0 

= _l0 § I I W ( r n,m\Un, V m ) /„ (li„) h m (v m ) . 
(n,m)GO 

Using a variational approach, this can be upper bounded by 

^ D {Qu n ,V n \R nm {-i ■\ r n,m)\\Pu n ,V rn \R nm {- -\ r n,m)) , 
(n,m)GO 
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Algorithm 2 EM Learning Algorithm 



Step I: Initialization 

ti°\u)= Pu (u), h$(v) = Pv (v), w<® (r\u,v) 

Step II: Recursive update 

J2fn\u) J2 w(i) ( r n,m\u,v)h^(v) 
ft l) (u) = ^ 



Step III: Output 



y] y] fn^ (u) ^ w{i) {fn,m\u,v) h$ (v) 
u'£{g u ]meV n v€[g m ] 

E h ™ ( V ) E U,(i) ( r «.™l U ' V ) fn } ( U ) 
h (W) ( v ) = ^ ^ 

v'€[gv]n£U m u<=[g u ] 

E *> W (rn,m\u,v) fi i+1 \u)hg +1 \v) 
(i+1) / I \ _ (n,m):r„, m =r 

E E ^ W (rn, m |«^)/^ +1) (n)^ +1) M 

re7£(n,m):r„ im =r 

X>n +1) («)^ +1) (^)^ (i+1) (r|«,«) 

p ^- |Ro (r) = EE /f » +1) w^ +1) ^ u,(i+1) (rM 

p£iiL(«)=/r i) («) 



where we introduce the variational probability distributions Q[/„,y m |R nm (u, u|r) that satisfy 

^2Qu n ,v m \R nm (u,v\r) = 1 

u,v 

and let 

/in w yrn,m\U, V) f n (u) h m {v) 

PU n ,V m \R nm { u ' v \ r ) = 



Hu',v' W ( r n,m\u', V 1 ) f n (u') h m (v f ) ' 

The variational EM algorithm we have developed uses alternating steps of KL divergence minimization 
to estimate the underlying generative model [1]. The results show that this variational approach gives 
the equivalent update rule as the standard EM framework (with a simpler derivation in Appendix B) 
which guarantees convergence to local minima. The update equations are presented in Algorithm 2. 
This learning algorithm, in fact, extends Thomas Hofmann's work and generalizes probabilistic matrix 
factorization (PMF) results [3], [15]. Its main drawback is that it is difficult to analyze because the effects 
of initial conditions and local minima can be very complicated. 
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Algorithm 3 VDVQ Clustering Algorithm via GLA Splitting (shown only for users) 
Step I: Initialization 

Let i = j = and cS' ' (0) be the average rating of movie m. 

Step II: Splitting of critics 
Set ' 

c% j) (u) u = 0,...,2*-l 

c^\u-2 l )+zg +1 ' 3 \u) u = 2\ . . . ,2* +1 -l 



C (^)(u): 



where the zin' 1 ' J \u) are i.i.d. random variables with small variance. 

Step III: Recursive soft K-means clustering for Cm^iu) for j = 1, . . . , J. 
1. Each training data is assigned a soft degree of assignment 7r n (u) to each of the critics using 

exp(-/3d(Ro,c^ } ( U ))) 

*<«>(«)- 



exp(-/3d(R ,c^V 

we\gu] 



where d (R 0l c^ j) (n)j = y E(n,m)eO ( c nm( n ) ~ r n,m) , 5u = 2 t+1 . 

2. Update all critics as 

Step IV: Repeat Steps II and III until the desired number of critics g u is obtained. 
Step V: Estimate of w(r\u,v) 

After clustering users/movies each into user/movie groups with the soft group membership 7r„ (u) and 
7r m (v), compute the soft frequencies of ratings for each user/movie group pair as 



w(r\u, v) 



^2 K n (u) TT rn (v) 

(n,rn)eO:R nm =r 

^ ^ 7T n (u) 7f m (v) 

r&Tl(n,m)£0:R nm =r 



C. Prediction and Initialization 

Since the primary goal is the prediction of hidden variables based on observed ratings, the learning 
algorithms focus on estimating the distribution of each hidden variable given the observed ratings. In 
particular, the outputs of both algorithms (after i iterations) are estimates of the distributions for R nm , 
U n , and V rn . They are denoted, respectively, P^, +1 j Ro (r), P^ + ^ (u), and Py + ^ (v)- Using these, one 
can minimize various types of prediction error. For example, minimizing the mean-squared prediction 
error results in the conditional mean estimate 

r£TZ 
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While minimizing the classification error of users (and movies) into groups results in the maximum a 
posteriori (MAP) estimates 

flW = argmaxp[2 |Ro (u) t)W = argmaxp^ ^(v). 

Likewise, after u/W (r\u, v) converges, we can also make hard decisions on groups by the MAP estimates 
first and then compute rating predictions by 

While this should perform worse with exact inference, this may not be the case with approximate inference 
algorithms. 

Both iterative learning algorithms require proper initial estimates of a set of initial group (of the user and 
movie) probabilities /„ (u) , h^(v) and observation model, w^°\r\u,v) since randomized initialization 
often leads to local minima and poor performance. To cluster users (or movies), we employ a variable- 
dimension vector quantization (VDVQ) algorithm [10] and the standard codebook splitting approach 
known as the generalized Lloyd algorithm (GLA) to generate codebooks whose size is any power of 2. 
The VDVQ algorithm is essentially based on alternating minimization of the average distance between 
users (or movies) and codebooks (that contains no missing data) with the two optimality criteria: nearest 
neighbor and centroid rules only on the elements both vectors share. The group probabilities are initialized 
by assuming that the VDVQ gives the "correct" group with probability e = 0.9 and spreads its errors 
uniformly across all other groups. In the case of users, one can think of this Algorithm 3 as a "/c-critics" 
algorithms which tries to design k critics (i.e., people who have seen every movie) that cover the space 
of all user tastes and each user is given a soft "degree of assignment (or soft group membership)" to each 
of the critics which can take on values between and 1. 



IV. Density Evolution Analysis 

Density evolution (DE) is well-known technique for analyzing probabilistic message -passing inference 
algorithms that was originally developed to analyze belief-propagation decoding of error-correcting codes 
and was later extended to more general inference problems [13]. It works by tracking the distribution 
of the messages passed in the graph under the assumption that the local neighborhood of each node is 
a tree. While this assumption is not rigorous, it is motivated by the following lemma. We consider the 
factor graph for a randomly chosen instance of this problem. The key assumption is that the outgoing 
edges from each user node are attached to movie nodes via a random permutation. This is identical to 
the model used for irregular LDPC codes [12]. 

Lemma 1: Let Ni{v) denote the depth-/ neighborhood (i.e., the induced subgraph including all nodes 
within / steps from v) of an arbitrary user (or movie) node v. Let the problem size N become unbounded 
with M = j3N for /3 < 1, maximum degree djy, and depth-Z^ neighborhoods. One finds that if 

toilv <1_5 ' 

for some 5 > and all N, then the graph Mi(v) is a tree w.h.p. for almost all v as N — >■ oo. 

Proof: The proof follows from a careful treatment of standard tree-like neighborhood arguments as 
in Appendix C. ■ 
For this problem, the messages passed during inference consist of belief functions for user groups (e.g., 
passed from movie nodes to user nodes) and movie groups (e.g., passed form user nodes to movie 
nodes). The message set for user belief functions is M u = V([g u ]), where V(S) is the set of probability 
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/d 

£ / (G ((61, n), . . . , (b d , r d )- a) G B) ^ (u, da) J] £ i/W ( V , <%) u; fa |u, «) (1) 
ri,...,r d j=l v 

4 + %,A) = j ^(^((ai,r 1 ),...,(a d ,r d );6)G^)^°)( V ,d6)nE^ ) ( n ' da ^ w (^l u ' v ) < 2 > 



j=l u 



distributions over the finite set S 1 . Likewise, the message set for movie belief functions is M v = V([g v \). 
The decoder combines d user (resp. movie) belief-functions a± (■),..., a d (-) G ,M U (resp. 61 (•),..., 6^(0 € 
A4„) using 

„ , 7 \ a n?=i E„ a j ( u ) w ( r j K v ) 

F d (a 1 ,r 1 ,...,a d ,r d ;b)- 



G d (61, n, ...,b d ,r d ;a) 



E„ Ilj EuOiH^ ( r jk v ) 



E u «(«) Ilj E„M V ) W fah w ) ' 

Since we need to consider the possibility that the ratings are generated by a process other than the 
assumed model, we must also keep track of the true user (or movie) group associated with each belief 
function. Let fi^\u,A) (resp. i/Wfa 5)) be the probability that, during the i-th iteration, a randomly 
chosen user (resp. movie) message is coming from a node with true user group u (resp. movie group v) 
and has a user belief function a(-) G A C M. u (resp. movie belief function &(•) G B C .M^)- The DE 
update equations for degree d user and movie nodes, in the spirit of [13], are shown in equations (1) and 
(2) where I(x G A) is defined as a indicator function 



I(x G A) 



1 if x G A 
if a; £ A 



Like LDPC codes, we expect to see that the performance of Algorithm 1 depends crucially on the degree 
structure of the factor graph. Therefore, we let Aj (resp. Tj) be the fraction of user (resp. movie) nodes with 
degree j and define the edge degree distribution to be Aj = Kjj / Efe>i ^-kk (resp. pj = Tjj/ Efe>i Ffcfc). 
Averaging over the degree distribution gives the final update equations 



^\u,B) = ^X d ^ +1 \u,B) 
d>l 

v^Xv,A) = Y,P d 4 +1 \v,A). 



d>i 

d>l 

We anticipate that this analysis will help us understand the IMP algorithm's observed performance for 
large problems based on the success of DE for channel coding problems. 

V. Experimental Results 

A. Details of Datasets and Training 

The key challenge of collaborative filtering problem is predicting the preference of a user for a given 
item based only on very few known ratings in a way that minimizes some per-letter metric d(r, r') for 
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Figure 2. Remedy for the Cold-Start Problem: Each plot shows the RMSE on the validation set versus the average number of 
observations per user for Netflix datasets. Performance is compared with three different matrix completion algorithms (OptSpace 
[21], SET [22] and SVT [23]) and an algorithm that uses the average rating for each movie as the prediction. For IMP and EM, 
r^m i prediction formula is used. 
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Figure 3. Each plot shows the RMSE on the validation set versus the average number of observations per user for synthetic 
datasets. Performance is compared with an (analytical) lower bound on RMSE assuming known user and movie group. 
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ratings. To study this, we created two smaller datasets from the Netflix data by randomly subsampling 
user/movie/rating triples from the original Netflix datasets which emphasizes the advantages of MP 
scheme. This idea was followed from [15], [6]. 

• Netflix Dataset 1 is a matrix given by the first 5,000 movies and users. This matrix contains 280,714 
user/movie pairs. Over 15% of the users and 43% of the movies have less than 3 ratings. 

• Netflix Dataset 2 is a matrix of 5,035 movies and 5,017 users by selecting some 5,300 movies and 
7,250 users and avoiding movies and users with less than 3 ratings. This matrix contains 454,218 
user/movie pairs. Over 16% of the users and 41% of the movies have less than 10 ratings. 

To provide further insights into the quality of the proposed factor graph model and suboptimality of the 
algorithms by comparison with the theoretical lower bounds, we generated two synthetic datasets from 
the above partial matrices. The synthetic datasets are generated once with the learned density | Rq (r), 

Pi; \n (u), and pty | Ro ( u ) an ^ trien randomly subsampled as the partial Netflix datasets. 

• Synthetic Dataset 1 is generated after learning Netflix Dataset 1 with g u = 9v = 8. 

• Synthetic Dataset 2 is generated after learning Netflix Dataset 2 with g u =g v = lQ. 

Additionally, to evaluate the performance of different algorithms/models efficiently, we hide 1,000 ran- 
domly selected user/movie entries as a validation set from each dataset. Note that the choice of g u and 
g v to obtain synthetic datasets resulted in the competitive performance on this validation set, but not fully 
optimized. Simulations were performed on these partial datasets where the average number of observed 
ratings per user was varied between 1 and 30. The experimental results are shown in Fig. 2 and 3 and 
the performance is evaluated using the root mean squared error (RMSE) of prediction defined by 



B. Results and Model Comparisons 

While the IMP algorithm is not yet competitive on the entire Netflix dataset [9], however, it shows 
some promise for the recommender systems based on MP frameworks. In reality, we have discovered 
that MP approaches really do improve the cold-start problem. Here is a summary of observations we've 
learned from the simulation study. 

1) Improvement of the cold-start problem with MP algorithms: From Fig. 2 results on partial Netflix 
datasets, we clearly see while many methods perform similarly with large amounts of observed 
ratings, IMP is superior for small amounts of data. This better threshold performance of the IMP 
algorithm over the other algorithms does help reduce the cold start problem. This provides strong 
support to use MP approaches in standard CF systems. Also in simulations, we observe lower 
computational complexity of the IMP algorithm even though we have developed computationally 
efficient versions of our EM algorithm (see Appendix B). 

2) Comparison with low-rank matrix models: Our factor graph model is a probabilistic generalization 
of other low-rank matrix models. Similar asymptotic behavior (for enough measurements) between 
partial Netflix and synthetic dataset suggests that the factor graph model is a good fit for this 
problem. By comparing with the results in [6], we can support that the Netflix dataset is much well 
described by the factor graph model. Other than these advantages, each output group has generative 
nature with explicit semantics. In other words, after learning the density, we can use them to generate 
synthetic data with clear meanings. These benefits do not extend to low-rank matrix models easily. 
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VI. Conclusions 

For the Netflix problem, a number of researchers have used low-rank models that lead to learning methods 
based on SVD and principal component analysis (PCA) with a least squares flavor. Unlike these prior 
works, in this paper, we proposed the new factor graph model and successfully applied MP framework to 
the problem. First, we presented the IMP algorithm and used simulations to show its superiority over other 
algorithms by focusing on the cold-start problem. Second, we studied quality of the model by deriving 
the DE analysis with a generalization error bound and complementing these theoretical analyses with 
simulation results for synthetic data generated from the learned model. 

References 

[I] I. Csiszar and G. Tusnady. "Information Geometry and Alternating Minimization Procedures", in Statistics & Decisions, 
Supplement Issue, 1:205-237, 1984. 

[2] R. M. Neal, G. E. Hinton. "A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants", in 

Learning in Graphical Models, 355-368, 1998. 
[3] T. Hofmann, "Probabilistic Latent Semantic Analysis", in Uncertainty in Artificial Intelligence. 1999. 
[4] J. Lafferty and L. Wasserman. "Challenges in Statistical Machine Learning", in Statistica Sinica, 16(2):307-323, 2006. 
[5] ACM SIGKDD KDD Cup and Workshop 2007. http://www.cs.uic.edu/~liub/KDD-cup-2007/proceedings.html 
[6] R. Keshavan, A. Montanari and S. Oh. Learning, "Learning Low Rank Matrices from O (n) Entries," in Proc. Allerton 

Conf. on Comm., Control and Computing, Monticello, Illinois, Sep. 2008. 
[7] S. T. Aditya, Onkar Dabeer and Bikash Kumar Dey, "A Channel Coding Perspective of Recommendation Systems," in 

Proc. 2009 IEEE Int'l. Symp. Information Theory, Seoul, Korea, Jun. 2009. 
[8] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques, Draft, 2008. 
[9] Netflix prize website: http://www.netflixprize.com 

[10] A. Das, A.V. Rao, and A. Gersho, "Variable-dimension Vector Quantization of Speech Spectra for Low-rate Vocoders", in 
Proc. Data Compression Conference, 1994. 

[II] David MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge, 2005. 

[12] T. Richardson and R. Urbanke, "The Capacity of Low-density Parity-check Codes under Message-passing Decoding", in 

IEEE Trans. Inform. Theory, vol. 47, pp. 599-618, Feb. 2001. 
[13] A. Montanari, "Estimating Random Variables from Random Sparse Observations", in Eur. Trans. Telecom, Vol. 19 (4), pp. 

385-403, April 2008. 

[14] S. Funk, "Netflix update: Try this at home" at http://sifter.org/~simon/journal/20061211.html 

[15] R. Salakhutdinov and A. Mnih, "Probabilistic Matrix Factorization", in Advances in Neural Information Processing Systems, 
20, MIT, 2008. 

[16] B.-H. Kim, "An Information-theoretic Approach to Collaborative Filtering", Technical Report, Texas A&M University, 
2009. 

[17] J. Yedidia, W.T Freeman and Y. Weiss, "Understanding Belief Propagation and Its Generalizations", in Advances in neural 

information processing systems, 13, MIT, 2001. 
[18] N. Alon, "Tools from Higher Algebra", in Handbook of Combinatorics, North Holland, 1995. 

[19] N. Srebro, N. Alon and T. Jaakkola, "Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices", 

in Advances in Neural Information Processing Systems, 17, 2005. 
[20] A. I. Schein, Al. Popescul, L. H. Ungar, D. M. Pennock, "Methods and Metrics for Cold-Start Recommendations", in 

Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information 

Retrieval, pp. 253-260, August 2002. 
[21] R. Keshavan, S. Oh, and A. Montanari, "Matrix completion from noisy entries", Arxiv preprint cs. IT/0906.2027, 2009 
[22] W. Dai, and O. Milenkovic, "SET: an algorithm for consistent matrix completion", Arxiv preprint cs. IT/0909.2705, 2009 
[23] J. Cai, E. Candes, and Z. Shen, "A singular value thresholding algorithm for matrix completion", Arxiv preprint 

math.OC/0810.3286, 2008 



14 



Appendix A 
Proof of Theorem 1 

This proof follows arguments of the generalization error in [19]. First, fix Y as well as X G R NxM . 
When an index pair (n, m) is chosen uniformly random, d(x n m , y n ,m) is a Bernoulli random vari- 
able with probability D (X, Y) of being one. If the entries of O are chosen independently random, 
\0\Do(X, Y) is binomially distributed with parameters \0\D(X, Y) and \0\e. Using Chernoff's in- 
equality, we get 

Pr (D (X, Y) > D (X, Y) + e) = Pr (\0\D (X, Y) < \0\D (X, Y) - \0\e) < e" 2| ° |e2 . 

Now note that d (x, y) only depends on the sign of xy, so it is enough to consider equivalence classes of 
matrices with the same sign patterns. Let / (N, M, g u , g v ) be the number of such equivalence classes. 
For all matrices in an equivalence class, the random variable Do (X, Y) is the same. Thus we take a 
union bound of the events {-^l-D (X, Y) > Dp (X, Y) + e} for e ach of these / (N, M, g u , g v ) random 

variables with the bound above and e = J ^°Sf( N ' M ^9^9^ l°g^ we ^ ave 



Pr U € Xs „ s , D (X, Y) > D (X, Y) + ^ I W *> - j < 6. 

Since any matrix X G Xg u ,9v can written as X = U T GV, to bound the number of sign patterns of 
X, f (N, M, g u , g v ), consider Ng u + Mg v + g u g v entries of U, G, V as variables and the NM entries 
of X as polynomials of degree three over these variables as 

9u 9v 

Xn,m = ^ ^ ^ ] u k t n ' 9k,l ' v l,m- 
k=l 1=1 

By the use of the bound in lemma 2, we obtain 

/ 4 e 3 MM \Ngu+Mgv+g u gv / \2eM \ N 9^+ M 9^+9^9^ 

f (N, M, g u , g v ) < < — 

V Ng u + Mg v + g u g v J \mm(g u , g v ) J 

This bound yields a factor of log — ■ 12eM — - in the bound and establishes the theorem. 

J 6 min(g„,£„) 

Lemma 2: [18] Total number of sign patterns of r polynomials, each of degree at most d, over q 
variables, is at most (8edr/q) q if 2r > q > 2. Also, total number of sign patterns of r polynomials with 
{±1} coordinates, each of degree at most d, over q variables, is at most (4edr/q) q if r > q > 2. 

Appendix B 
Derivation of Algorithm 2 

As the first step, we specify a complete data likelihood as 

Pr (yR nm — fn,mi U n — U ni Vm — ^m) — ^ (j'n,m\'U"ni V m ) f n (lln) h m (v m ) 

and the corresponding (negative) log-likelihood function can be written as 
R c (0) = -log ] [ Pr (R nm = r n , m , U n = u n , V m = v m ) 

(n,m)eO 

= - ^ [l°gw (r n>m \u n ,v m ) +log/„ (u n ) + \ogh m (v m )] 

(n,m)sO 

The variational EM algorithm now consists of two steps that are performed in alternation with a Q 
distribution to approximate a general distribution. 



15 



A. E-step 

Since the states of the latent variables are not known, we introduce a variational probability distribution 

Qu n ,v m \R nm (u,v\r) subject to E Qu n ,v m \R nm (u,v\r) = 1 

u,v 

for all observed pairs (n, m). Exploiting the concavity of the logarithm and using Jensen's inequality, we 
have 

R (8) = - E log^2Pi(R nm = r n:m ,U n = u n ,V m = v m ) 

(n,m)eO u,v 

E, / | x w (r n ,m\u, V) fn (u) h m (v) 
log > Qu n ,v m \R nm (u,v\r) — 

(n,m)eO u,v ^u n ,v m \n nm \ i \ / 

(n,m)eO u,v ^U n ,V m \K nm V > I I 

±R(9\Q)- E H(Q(-\u,v,r)) 

(n,m)gO 

= R(0; Q) 

To compute the tightest bound given parameters 6 i.e., we optimize the bound w.r.t the Qs using 

= 0. 



R(9-Q)+ £ E A «>^ 

(n,m)eO u,v 

These yield posterior probabilities of the latent variables, 

_ W (r n!m \u, v) f n (u) fl m (v) 

J2u',v> W ( r n,m\u', V') f n (u f ) h m (v') 

Also note that we can get the same result by Gibbs inequality as 



Pu n ,V m \R nm (u,v\r;e) = Q* Un ,v m \ Rnm (u,v\r;i 



R(0)<- E J2Qu n ,v m \R nm (u,v\r)log 



w (r njTn \u,v) f n (u) h m (v) 



Qu n ,V m \R nm (u,v\r) 

= Yj D {Qu n y n \R nm {-i^n,m)\\Pu n y m \R nm {-r\r n ,rn)) ■ 
(n,m)eO 

B. M-step 

Obviously the posterior probabilities need only to be computed for pairs (n, m) that have actually been 
observed. Thus optimize 

*(M) = - E E Q*u n y m \R nm (u,v\r;0j {log w (r n>m \u,v) + log /„(«)+ log h m (v)} 

(n,m)eO u,v 

w(r n ,m\u,v) f n {u)h m {v) 

with respect to parameters 6 which leads to the three sets of equations for the update of 

w(r\u,v), f n (u), h m (v). 
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Moreover, for large scale problems, to avoid computational loads of each step, combining both E and M 
steps by plugging Q function into M-step gives more tractable EM Algorithm. The resulting equations 
are defined in Algorithm 2. 



Appendix C 
Proof of Lemma 1 

Starting from any node v, we can recursively grow Mi + i(v) from Mi{v) by adding all neighbors at 
distance i + Let A4 be the number of outgoing edges from Ni{v) to the next level and \ . . . , b^n be 
the degrees of the n« available nodes that can be chosen in the next level. The probability that the graph 
remains a tree is 



J2sc[n],\S\=AiIlseS ^ 



A % 

where the numerator is the number of ways that the Ai edges can attach to distinct nodes in the next level 
and the denominator is the total number of ways that the Ai edges may attach to the available nodes. 
Using the fact that the numerator is an unnormalized expected value of the product of Ai 6's drawn 
without replacement, we can lower bound the numerator using 



Sc[n], \S\=Ai seS 



Ai J \ Tbi j Ai ! 



m 



This can be seen as lower bounding the expected value of Ai b's drawn from with replacement from a 
distribution with a slightly lower mean. Upper bounding the denominator by (n^)" 4, / 'AA gives 



p(A,b«) > 



riibi) 'Ail 



riij \ birii 



> II- 

ni bin. 

Now, we can take the product from i = 0, ... ,1 — 1 to get 

l-i 

Pr {Mi{v ) is a tree) = JJPr (Ni+\{v) is a tree|A/o(w), . . . ,Mi{v) are trees) 

i=0 

>Jj(l-^ AKd ~ 1] 



i=0 



rii bin 



i=0 



ni bim 



1 N /_^ + d»(d-l) 



> 1- 1 + 



d 2 -lj \{3N-d l (3N-d l 
1 \ d 2l+1 



d 2 -I J (3N -d v 
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because A { < d i+ \ £££ A ? < d 2 ^ = d 21 (l + and > /3iV - £}=o^ > /3iV - cf +1 . 

Examining the expression 

J21+1 

log p N _ d i < (21n + 1) log d N - log N + 0(1) < -5logN + 0(l) 

shows that the probability of failure is O (N~ s ) . 

Let Z be a r.v. whose value is the number of user nodes whose depth-/ neighborhood is not a tree. We 
can upper bound the expected value of Z with 

With Markov's inequality, one can show that 



/ i a/ 2 \ EZ] OiN 1 - 6 ) 

V - ; - /v 1 - 5 / 2 ~ TV 1 -' 5 / 2 



Therefore, the depth-Z neighborhood is a tree (w.h.p. as N — > oo) for all but a vanishing fraction of user 
nodes. 



